[spark] Expose Paimon metrics via JMX for Prometheus scraping#7647
[spark] Expose Paimon metrics via JMX for Prometheus scraping#7647junmuz wants to merge 2 commits into
Conversation
f583df8 to
3261073
Compare
|
@JingsongLi @Zouxxyy Can I get a review on the PR? This change allows scraping metrics from Prometheus. |
|
@YannByron @Zouxxyy I see you have been pretty active with the Spark - Paimon connector. Can I get a review on the PR for exposing Paimon metrics? |
9cc14b2 to
62b13d2
Compare
62b13d2 to
9859bd6
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Thanks for adding metrics observability to the Spark integration!
Architecture concerns:
-
PaimonMetricsSourcesingleton with its ownJmxReporter: Bypassing Spark's MetricsSystem snapshot limitation is pragmatic, but running a separate JmxReporter means Paimon metrics appear under a different MBean domain than Spark's own metrics. Document this clearly — operators using jmx_prometheus_javaagent need to know which domain to scrape. -
Thread safety of
SparkMetricGroup: TheregisterCodahaleGaugemethod doesremove+register. WhileMetricRegistryis internally thread-safe, the two-step operation has a race window where another thread could register between remove and register. In practice this is unlikely since commit is single-threaded per executor, but worth noting. -
Gauge semantics: Registering counters as Codahale Gauges (via
paimonCounter::getCount) works but loses the "counter" semantic. Monitoring tools that expect monotonically increasing counters won't get properrate()calculations from a Gauge. Consider usingcodahaleRegistry.counter()for Paimon counters instead. -
Memory lifecycle: The
SparkMetricGroup.close()removes registered names from the registry. But ifclose()is never called (e.g., task failure), the stale gauges remain in the registry pointing to deallocated state. TheregisterCodahaleGaugeremove-before-register pattern helps with subsequent commits, but for multi-table scenarios, old gauges could accumulate. -
Metric naming:
paimon.{table}.{groupName}.{metricName}— the table name may contain dots or special chars. Thesanitize()method replaces non-alphanumeric chars with_, which is good, but consider also handling empty table names gracefully.
Overall the approach is sound for V1. Please address the counter vs gauge semantic issue.
Purpose
Bridge Paimon commit metrics into Spark's MetricsSystem using a Codahale MetricRegistry so they are accessible as JMX MBeans. This enables jmx_prometheus_javaagent to scrape Paimon metrics (commit duration, files added/deleted, records appended, partitions written, etc.).
Tests